Corpus Selection Approaches for Multilingual Parsing from Raw Text to Universal Dependencies

نویسندگان

  • Ryan Hornby
  • Clark Taylor
  • Jungyeul Park
چکیده

This paper describes UALing’s approach to the CoNLL 2017 UD Shared Task using corpus selection techniques to reduce training data size. The methodology is simple: We use similarity measures to select a corpus from available training data (even from multiple corpora for surprise languages) and use the resulting corpus to complete the parsing task. The training and parsing is done with the baseline UDPipe system (Straka et al., 2016). While our approach reduces the size of training data significantly, it retains performance within 0.5% of the baseline system. Due to the reduction in training data size, our system performs faster than the naı̈ve, complete corpus method. Specifically, our system runs in less than 10 minutes, ranking it among the fastest entries for this task. Our system is available at https://github. com/CoNLL-UD-2017/UALING.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RACAI's Natural Language Processing pipeline for Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the sha...

متن کامل

A System for Multilingual Dependency Parsing based on Bidirectional LSTM Feature Representations

In this paper, we present our multilingual dependency parser developed for the CoNLL 2017 UD Shared Task dealing with “Multilingual Parsing from Raw Text to Universal Dependencies”1. Our parser extends the monolingual BIST-parser as a multi-source multilingual trainable parser. Thanks to multilingual word embeddings and one hot encodings for languages, our system can use both monolingual and mu...

متن کامل

A Semi-universal Pipelined Approach to the CoNLL 2017 UD Shared Task

This paper presents the TRL team’s system submitted for the CoNLL 2017 Shared Task, “Multilingual Parsing from Raw Text to Universal Dependencies.” We ran the system for all languages with our own fully pipelined components without relying on either pre-trained baseline or machine learning techniques. We used only the universal part-of-speech tags and distance between words, and applied determi...

متن کامل

A Transition-based System for Universal Dependency Parsing

This paper describes the system for our participation of team Wanghao-ftd-SJTU in the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In this work, we design a system based on UDPipe1 for universal dependency parsing, where transitionbased models are trained for different treebanks. Our system directly takes raw texts as input, performing several intermedia...

متن کامل

A Fast and Lightweight System for Multilingual Dependency Parsing

Following Kiperwasser and Goldberg (2016), we present a multilingual dependency parser with a bidirectionalLSTM (BiLSTM) feature extractor and a multi-layer perceptron (MLP) classifier. We trained our transition-based projective parser in UD version 2.0 datasets without any additional data. The parser is fast, lightweight and effective on big treebanks. In the CoNLL 2017 Shared Task: Multilingu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017